Due Jun 7, 2:59 AM EDT
Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
Which of these statements about mini-batch gradient descent do you agree with?
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:

Which of the following do you agree with?
Suppose the temperature in Casablanca over the first three days of January are the same:
Jan 1st: θ1=10oC
Jan 2nd: θ210oC
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with β=0.5 to track the temperature: v0=0, vt=βvt−1+(1−β)θt. If v2 is the value computed after day 2 without bias correction, and v2corrected is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don't actually need one. Remember what bias correction is doing.)
Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: vt=βvt−1+(1−β)θt. The red line below was computed using β=0.9. What would happen to your red curve as you vary β? (Check the two that apply)

True, remember that the red line corresponds to β=0.9. In lecture we had a green line $$\beta = 0.98) that is slightly shifted to the right.
True, remember that the red line corresponds to β=0.9. In lecture we had a yellow line $$\beta = 0.98 that had a lot of oscillations.
Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],...,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)
Which of the following statements about Adam is False?